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1. INTRODUCTION 


Network Distributed Global Memory (NDGM) is implemented on top of a message passing layer 
called Message Relay System (MRS). MRS presents a common interface for sending and receiving 
messages over several different interfaces: Transport Control Protocol/lntemet Protocol (TCP/TP), Shared 
Memory Arena, Fifos, and "stdio" file pointers. NDGM sends messages via MRS to copy sections of local 
memory to and from the global address space. 

In addition to data transfer, NDGM provides mechanisms for program coordination. These include: 

Semq)hores: Allow one node exclusive access to some service 
Memory Locks: Allow one node exclusive access to an area of memory 
Barriers: Allow multiple nodes to block until all have reached a certain point 

NDGM is implemented using a client-server model. 

MRS allows processes on heterogeneous machines to communicate via message passing. Each node 
in MRS can choose from several different physical transport mechanisms for its data. In Figure 1, nodes 
on the same machine communicate via shared memory, while communication across machines is 
accomplished via TCP/IP. 

In addition to direct message passing, MRS allows nodes to send their messages indirectly. For 
example, node 1 can communicate with node 3 by relaying its message through nodes 2 and 4. Also a 
message may be broadcast to all known nodes. For example, node 1 could send the same message to 
tKxies 2, 3, and 4 by sending one broadcast message to node 2. 

The maximum size of the message is determined when the node is created and is application defined. 
It is also possible to directly access a node’s data -space in order to minimize buffering. 

As shown in Figure 2, an application might define some custom structure to pass as messages that 
contain an opcode and some arguments. By assigning a structure pointer to the data area of the node, the 
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Figure 1. Message relay system basics. 
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message data can be accessed as any aibitrary structure. In addition, the dau does not have to be copied 
into the MRS message buffer since it is already in the proper position. Such a codelette might look like 
this: 

typedef struct { 

int opcode; 

int array_size; 

float data[l]; 

) MY_STRUCT; 

MY.STRUCT *sptr. 

MRS_NODE *node; 

/* Open the node */ 
node = mrs_open(.); 

/* Assign sptr to the iKxle’s data */ 
sptr = MRS_NODE_DATA(node); 

/* Fill it up */ 
for(i=0 ; i < 1000 ; i++){ 

sptr->data[i] = 5.0 * i; 

) 

mrs_send(iKXle, sizeof(MY_STRUCr) + (1000 * sizeof(float)), (char *)sptr); 

An MRS tKxle, shown in Figure 3, provides a convenient abstraction that allows message passing to 
be accomplished across several low-level mechanisms. Each TKxle contains a unique ID, the hostname on 
which it was created, and the type of processor of that machine. The node contains function pointers to 
routines that provide the actual data transport (Currently, transport is provided for TCTP/IP, a shared 
memory queue mechanism, a generic memory buffer, and stdio file pointers such as Fifos. 

The MRS node also points to a message area. Hiis area can be provided by the user, or the MRS 
library will allocate one via calloc(). When a message is received, it contains the owner (originator) of 
the message, the last tuxle to handle the message, and an intended delivery path. If the receiving node 
can connect to the destination rxxle, the intended list is ignored and the message is delivered directly. 
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Figure 3. Message relay system node infomiation 

There are also "broadcast" messages. These are messages that, once received, are broadcast to all 
known nodes. 

Since the node’s data space can be assigned by the af^lication, many nodes can share the same 
memory area if the application knows that only one node will access the area at a time. 

In Figure 4, for example, if an t^plication needed to maintain 70 TCP/IP connections, each capable 
of a one megabyte message, the application could assign the same data buffer to all of the nodes. As long 
as data is copied out of the data buffer if needed for later use, it is not necessary to maintain a separate 
buffer for each node. 
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Figure 4. MRS nodes sharing data area . 


2. NETWORK DISTRIBUTED GLOBAL MEMORY {NDGM) 


NDGM is a layer of routines on top of MRS that frees the application from much of the bookkeeping 
of message passing. An application communicates with others by writing and reading data into a virtual 
space. Even though file memory physically resides on several distributed machines, it is accessed as a 
contiguous data buffer. The NDGM library manages the details of accessing this global memory. 

The actual layout of the memory is described in a file that is given to the shell script ndgm start. 
Each line in the file gives a hostname and a size in bytes. Lines that start with "#" are ignored: 


# NDGM Description File 
Node ciHil.arl.armyjnil 5000000 

Node cpu2.arl.army.mil 20000000 

Node cpu3.arl.army.mil 10000000 
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The shell script ndgm_start will execute an rsh to each of these hosts and start a server program. 
Once all of the servers have been started, they are all given the mapping of all of the servers in the 
system. In this way. all of the servers know the assigned starting and end address of every other server. 
This mapping is also written to the file ndgm_current_net.dat. 

To use the virtual buffer created by ndgm start, an application makes a call to ndgm_init(). This reads 
the system description from one of the servers and assigns a unique node ID. This ID stays constant for 
the duration of the application. The system description defines mapping from the global address space 
to local machine offsets. 

Each of the servers started by ndgm start waits for requests horn clients to access its data. The server 
is continuously executing a blocking read, so little CPU time is being consumed. If the server was started 
by root, it automatically tries to lock its data into core memory so that it does not get swapped out by the 
operation system. The server will not exit until it receives an NDGM TERM command from a client. 

The system can be terminated by runrung the ndgm stop shell script All servers started up by 
ndgm start are sent an NDGMJTERM command. 

As in Figure S, NDGM sets up an arbitrary size virtual memory array. This memory area is physically 
distributed across several machines, but is accessed as one continuous memory block. There are routines 
to put data into global memory and to get data from global memory. These routines access the data as 
a contiguous block of bytes and not as any particular data type in much the same style as memcopy(). 
This leaves the actual use of the area application defined. 

The actual transport of the data is accomplished through MRS and is transparent to the application. 
If the requested area spans several physical machines, the application need not be aware of its layout 
The access routines handle all of the necessary message passing. 

The physical memory for each block of data is allocated from shared memory. This allows fast access 
by an tqrplication, to a block that is on the same physical machine. Once again, ti>e access routines take 
care of detecting this situation. 
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Figute 5. NDGM system 

In addition to data access routines, NDGM (xintains synchronization mechanisms: memory locks, 
semtyihores, and barriers. These mechanisms are implemented in the NDGM server and do not use the 
associated operating system mechanisms. 

Memory locks allow an application to obtain an advisory lock on any section of the global memory. 
These locks do not restrict access to the data but prevent other locks on the same memory area from being 
obtained until the original lock is released. 

Memory locks are always blocking. This means that a call to lock a section of memory will not return 
until the entire section has been successfully locked or upon system error. The NDGM server maintains 
a list of pending requests and grants memory lock^ when the resource becones available. 

Semaphores are implemented in much the same way as memory locks. Only one cliertt can obtain 
a specified semaphore at any time; all others are blocked until the current owner of the semaphore releases 
it 
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Barriers are used to cdordinate aoiviiy between several processes One process sets an iruual bamer 
value. Each process that subsequently checks into the barrier will decrement that value When the bamer 
value reaches zero, all processes that have checked in arc notified The bamer value is then automatically 
reset. 


Access to global memory and syrKhroni/ation mechanisms is accomplished through the following 
routines; 


int 

ndgm_init(rKlgm_node_id, hostname, mis 
int ndgni_node_id 
char *hostname 

int mrs_node_id 

int verbose 

int 

ndgm_put(addiess. source, length) 

NDGM_ADDR address 
void *source Puts Dma Into Global Memory 

NDGMJLENGTH length 

int 

ndgm_get(address, destination, length) 


NDGM_ADDR address 
void 

NDGMJLENGTH 

*destination 

length 

Gets Data From Global Memory 

int 

ndgm_lock(addiess, length) 
NDGMJiDDR address 
NDGMJLENGTH 

length 

Obtains Memory Lock 

int 

ndgin_unlock(address, length) 
NDGM_ADDR address 
NDGM_LENGTH 

length 

Unlocks a Sectitm of Global Memory 

int 

ndgm_sema_get(sema_id) 
NDGM_KEY semajd 


Obtain Semaphore 

void 

ndgm_sema_release(semajd) 
NDGMJCEY semajd 


Release Sem^hore 


.hoct_id. verbose) 

Connect to Global Memory 
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ira 

n(]gin_btiTier_tnit(bamer_id. value) 

NDGM_KEY borner id 

iiu value Initialize Barrier 


int 

ndgm.barher wait(baiTier_id) 

NDGM_KEY barrier Jd Check Into a Barrier 

int 

ndgm.dumpCnienatne, address, length) 
ctuir *filenarr^ 

NDGM_ADDR address 
NDGM_LENGTH length 

int 

ndgm_undump(filename, address, length) 
char ^filename Read Memory Image From 

NDGM_ADDR address File (Parallel lAD) 

NDGM_LENGTH length 


Dump Memory Image to 
File (Parallel lA)) 


3. DZONAL: AN APPLICATION 


Dzonal is a distributed version of "The Zonal Code" by Dr. Nishee. Patel (Patel. Sturek, and Smith 
1989). The purpose here is not to document the Dzonal code but to show how NDGM was used to 
develop a distributed application and utilities. 

Dzonal is a full three-dimensional, Navier-Stokes flow solver for supersonic flow. A copy of the 
Dzonal executable is nm on multiple machines and coordinated through the use of barriers. Unix shell 
scripts are used to decompose the computational domain into fairiy even chunks and to stan the 
application on the remote machine. There is no explicit message passing. Rather, all coordination is 
accomplished through NDGM. 

As an example, assume there is a geometry with two blocks (Figure 6). The first is 20x20x20 and 
the sectmd is 30x20x30. The two blocks overlap along the I dimension at I=[ 19,20] of block 1 and 
I=[l,2] of block 2. 

In addition, assume that there are five identical processors on which to distribute the application. Each 
is a workstation with the same amount of main memory and disk space. Their hostnames are CPUl, 


9 









CPU2, CPUS, CPU4. and CPUS. With this arrangement, the layout (chosen by the Dzonal domain 
decomposition utility) might look like Table 1. 


Table 1. Domain Decomposition With Identical Processors 

















cpul 

1 

1 

20 

1 

20 

1 

11 

cpu2 

1 

1 

20 

1 

20 

10 


cpu3 

2 

1 

30 

1 

20 

1 

11 

q;>u4 

2 

1 

30 

1 

20 

10 

21 

cpu5 

2 

1 

30 

1 

20 

20 

30 


Notice that there is a 1-K plane overiap among processors. This allows each processor to only 
compute on interior points: inner-block boundaries are communicated each timestep. 

With this layout, the exaa amount of global memory is assigned to each processor that will allow its 
interior points to be assigned to that processor's local memory. For example, CrPU4 has K planes 10-21 
of block 2. Plane 10 and 21, however, are interior to CPU3 and CPUS, so they are assigned to those 
CPUs respectively. CPU4 will have K planes 11-20 in its local memory. With a 32-bit floating point 
number, and assuming there is a need to store SO values for each grid node (X, Y, Z, Temp. Press ...), 
CPU4 would be assigned [lOplanes * 3(Xi) * 20(}) * SOvalues * 4 bytes] = 1.2 Megabytes of local 
memory. This local memory would then be mapped to some global address (CPUl would start at 0). The 
global addresses assigned to cpu4 will be called Add_B2_KII through Add_B2_K20. 

For each timestep, CPU4 will calculate values for its interior points and possibly any global boundary 
conditions (i.e., a wall at J = 1, outflow at 1 = 30, etc.) then write those values to global memory. Since 
all of these values are in local memory, this is a fast-writirig operation. Once CPU4 has ouQ>ut its values, 
it waits in a barrier. When all CPUs have checked into the barrier (they have all computed and ouQ)ut 
their data) they can then read back the information for their boundaries. This includes the cross block 
conununication for the overltq) of block 1 and block 2. While this communication scheme might get quite 
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Figure 6. Example Dzonal application . 


complicated, if explicit message passing is used, it is quite straightforward when viewed as a single shared 
memory. 

Nodes that contain absolute boundaries then apply the appropriate boundary conditions and write that 
data to global memory. When all nodes check into the final barrier, the application continues to the next 
timestep. The local to global mapping of this application would look like Figure 7. 

Since the NDGM servers are separate processes accessed by the Dzonal clients, other clients can access 
the global memory while the solution is developing. This allows quite useful debugging utilities to be 
developed. Two such utilities that have been developed are dzjook and dz_draw_plane. Dzjook allows 
the user to look at any value in global memory by entering its UK value. Dz draw_plane will pull any 
subsection of any plane (I, J. or K) out of global memory and send it to a netwoilc visualization program 
called Bop_ylew (Qarke 1994). A fully operational system might look like Figure 8. 
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CPVl CPU2 CPV3 CPU4 CPUS 


Figure 7. Dzonal processes. 



Figure 8. . Operational system. 
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4. CONCLUSIONS 


NDGM allows applications to view a physically distributed group of processors as a shared memory 
parallel machine. Although many factors determine communication speed such as CPU and network load, 
the following numbers are good "ballpark" numbers for access times; 

Transfer to local NDGM server 6 MBytes/second 

Transfer to remote NDGM server 4 KBytes/secotxl 

These numbers are averages on a network of Silicon Graphics Indigo workstations run during peak 
and nonpeak hours. They include different transfer lengths and all setup overhead. Your mileage may 
vary. 

Assuming these transfer rates are sufficient, NDGM can be used to develop and run parallel 
ap(dications on networks of relatively low-cost platforms. When the time spent in a batch queue is taken 
into account, the total wall clock time may be comparable to larger mote expensive platforms. The design 
goal with NDGM is to minimize (doUans/grid node) while maintaining an acceptable (wall clock time/grid 
node). 


13 






Intentionally left blank. 


14 







5. REFERENCES 


Claike, Jerry. "Remote Data Transfer (RDT); An Interprocess Data Transfer Method for Distributed 
Environments." BRL-TR-3339, U.S. Army Ballistic Research Laboratory, Aberdeen Proving Ground, 
MD. May 1992. 

Oaike, Jerry A. "Distributed Heterogeneous Visualization. Bop and Bop View." ARL-CR-172, U.S. 
Army Research Laboratory. Aberdeen Proving Ground, MD, September 1994. 

Dykstra, Phillip C. "The BRL CAD Package, An Overview." U.S. Ballistic Research Laboratory, 
Aberdeen Proving Ground. MD, October 1988. 

Muuss, Michael. "Workstations. Networking, Distributed Graphics, and Parallel Processing." U.S. Army 
Ballistic Research Laboratory, Aberdeen Proving Ground, MD, October 1988. 

Patel, N., W. Sturek, and G. Smith. "Parallel Computation of Supersonic Flow Using a 
Three-Dimensional Navier-Stokes Code." BRL-TR-30XX, U.S. Army Ballistic Research Laboratory, 
Aberdeen Proving Ground, MD, November 1988. 

"XDR: External Data Representation Standard." RFC-1014, DDN Network Information Center, Menlo 
Park, CA. June 1987. 



Intentionally left blank. 



16 






2 Administrator 

Defense Technical Info Center 
ATTN: DTIC-DDA 
Cameron Station 
Alexandria. VA 22304-6145 

1 Conunander 

U.S. Army Materiel Command 
ATTN: AMCAM 
5001 Eisenhowa Ave. 

Alexandria, VA 22333-0001 

1 Directra 

U.S. Army Research Laboratory 
ATTN: AMSRL-OP-SD-TA. 

Records Management 
2800 Powder Mill Rd. 

Adelphi, MD 20783-1145 

3 DirecUM' 

U.S. Army Research Laboratory 
ATTN: AMSRL-OP-SD-TL. 

Technical Library 
2800 Powder Mill Rd. 

Adelphi. MD 20783-1145 

1 Director 

U.S. Army Research Laboratory 
ATTN: AMSRL-OP-SD-TP, 

Technical Publishing Branch 
2800 Powder MiU Rd. 

Adelphi. MD 20783-1145 

2 Commander 

U.S. Army Armament Research, 
Development, and Engineering Center 
ATTN: SMCAR-TDC 
Picatinny Arsenal, NJ 07806-5000 

1 Director 

Benet Weapons Laboratory 
U.S. Army Armament Research, 
Development, and Engineering Center 
ATTN: SMCAR-CCB-TL 
Watervliet. NY 121894050 


1 Conunander 

U.S. Army Missile Command 
ATTN: AMSMl-RD-CS-R (DOC) 

Redstone Arsenal, AL 35898-5010 

1 Commander 

U.S. Army Tank-Automotive Command 
ATTN: AMSTA-JSK (Armor Eng. Br.) 
Warren. Ml 48397-5000 

1 Director 

U.S. Army TRADOC Analysis Command 
ATTN: ATRC-WSR 

White Sands Missile Range, NM 88002-5502 

1 Commandant 

U.S. Army Infantry School 
ATTN: ATSH-WCB-O 
Fort Benning, GA 31905-5000 


Aberdeai Proving Ground 

2 Dir. USAMSAA 
ATTN: AMXSY-D 

AMXSY-MP, H. Cohen 

1 Cdr, USATECOM 
ATTN: AMSTE-TC 

1 Dir. USAERDEC 
ATTN: SCBRD-RT 

1 Cdr, USACBDCOM 
ATTN: AMSCB-Cn 

1 Dir, USARL 

ATTN: AMSRL-SL-I 

5 Dir. USARL 

ATTN: AMSRL-OP-AP-L 


1 DirecttM- 

U.S. Army Advanced Systems Research 
and Analysis OfHce (ATCOM) 

ATTN: AMSAT-R-NR, M/S 219-1 
Ames Research Center 
Moffett Field. CA 94035-1000 


17 




No. of 

Copies Organization 


1 Computer Sciences Corporation 
ATTN; Dr. David Brown 
3160 Fairview Park Dr. 

Mail Code 265 

FaUs Church, VA 22042 


Aberdeen Proving Ground 
11 Dir, USARL 

ATTN: AMSRL-Cl, William Mermagen 
AMSRL-CI-A, Harold Breaux 
AMSRL-CI-AC. 

John Grosh 
Phillip Dykstra 
Jerry Clarke 
Deborah Thompson 
Jennifer Hare 
Eric Mark 
Richard Angelini 
Kathy Burke 

AMSRL-CI-C, Walter Sturek 





USER EVALUATION SHEET/CHANGE OF ADDRESS 


This Laboratory utvdertakes a continuing effort to improve the quality of the reports it publishes. Your 
comments/answers to the items/questions below will aid us in our efforts. 

1. ARL Report Number ARL-CR-173 _DateofReport October 1994 _ 

2. Date Report Received_ 

3. Does this report satisfy a need? (Comment on purpose, related project, or other area of interest for 

which the report will be used.)_ 


4. Specifically, how is the report being used? (Information source, design data, procedure, source of 
ideas, etc.)_ 


5. Has the information in this report led to any quantitative savings as far as man-hours or dollars saved, 
operating costs avoided, or efficiencies achieved, etc? If so, please elaborate._ 


6. General Comments. What do you think should be changed to improve future reports? (Indicate 
changes to organization, technical content, format, etc.)_ 


Organization 


CURRENT Name 

ADDRESS_ 

Street or P.O. Box No. 


City, State, Zip Code 

7. If indicating a Change of Address or Address Correction, please provide the (Turrent or Cbrrect address 
above and the Old or Incorrect address below. 


Organization 


OLD Name 

ADDRESS _ 

Street or P.O. Box No. 


City, State, Zip Code 


(Remove this sheet, fold as indicated, tape closed, and mail.) 
(DO NOT STAPLE) 












